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Abstract — The paper studies the problem of securely 
storing biometric passwords, such as fingerprints and 
irises. With the help of coding theory Juels and Wattenberg 
derived in 1999 a scheme where similar input strings 
will be accepted as the same biometric. In the same time 
nothing could be learned from the stored data. They called 
their scheme a fuzzy commitment scheme. 

In this paper we will revisit the solution of Juels and 
Wattenberg and we will provide answers to two important 
questions: What type of error-correcting codes should be 
used and what happens if biometric templates are not 
uniformly distributed, i.e. the biometric data come with 
redundancy. 

Answering the first question will lead us to the search for 
low-rate large-minimum distance error-correcting codes 
which come with efficient decoding algorithms up to the 
designed distance. 

In order to answer the second question we relate the 
rate required with a quantity connected to the "entropy" 
of the string, trying to estimate a sort of "capacity", if we 
want to see a flavor of the converse of Shannon's noisy 
coding theorem. 

Finally we deal with side-problems arising in a practical 
implementation and we propose a possible solution to the 
main one that seems to have so far prevented real life 
applications of the fuzzy scheme, as far as we know. 

I. Introduction 

Traditionally passwords for access to a computer are 
not stored in plain-text but rather as images under a 
hash function. Hash functions have the property that 
they can easily be computed for any input string but 
it is computationally not feasible to compute any pre- 
image of a given image point. Usually it is also desirable 
that hash functions are 'collision resistant', this means 
it is computationally not feasible to come up with two 
different input strings which are mapped to the same 
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hash values. Because of the last property standard hash 
functions such as SHA-1 are not suitable to store biomet- 
ric data. What we would need is a hash function having 
the property that similar input strings will result in the 
same hash values. Until recently no good scheme has 
been known and many practical systems store biometric 
data such as fingerprints to access a personal computer 
in plain-text. 

Martinian, Yekhanin and Yedidia ll2D call the problem 
at hand the secure biometric storage problem. The prob- 
lem arises when biometrics such as fingerprints and irises 
are used instead of passwords. It is desirable for security 
reasons that the biometric data is not stored in plain-text 
on a storage device but rather in encrypted form. When a 
user wants to access the system the access device should 
grant access as long as two biometrics do not differ by 
more than a certain amount of bits. 

In the literature there are several schemes which use 
ideas from coding theory to tackle the secure biometric 
storage problem. According to the authors of [21] the 
first solution was proposed by Davida, Frankel and 
Matt in (6). In their own paper [21 J Martinian et. al. 
propose an information theoretic solution based on the 
Slepian-Wolf theorem. This system has the property that 
the biometric is securely stored, it has however the 
disadvantage that a person who has access to the stored 
data and the implemented algorithm can compute a bit 
string which will provide access to the system even 
though the bit string is not close to any biometric data. 

In this paper we will be concerned with an algorithm 
first proposed by Juels and Wattenberg ifTTIl . Also this 
system makes heavily use of coding theory. 

The paper is structured as follows: In the next section 
we revisit the algorithm of Juels and Wattenberg. The 
original paper [17 ] leaves two important questions open. 
First what are good practical codes to be used having 
very large block length and which provide the robustness 
and security level required for the secure biometric 



storage problem. We provide answers to this problem in 
Section III. The second question arises when the possible 
biometric bit-strings are not uniformly distributed. Of 
course this is an important issue as all practical systems 
are suffering this problem. We will address this problem 
in Section IV. 

II. The Fuzzy Commitment Scheme of Juels 

AND WATTENBERG 

Juels and Wattenberg [17] proposed a 'a fuzzy com- 
mitment scheme' capable of storing biometric data in 
binary form. In this section we describe the scheme for 
data over a general alphabet and we derive a strengthened 
theorem. 

Let F = F q be a finite field. We assume that the 
biometric data is given in form of a vector b G F n . 
Assume C C F n is an [n, k, d] linear code and distance 
d is given by 

d = 2t + 1. 

We also assume that there is an efficient decoding 
algorithm capable of decoding up to t errors. 

Let h : ¥ n — > ¥ l be a hash function. In particular h 
should be collision resistant and it should be computa- 
tionally not feasible to compute an x G h~ 1 (y) for any 
y G F'. 

Let b G F n be the biometric one wants to store on 
the computer. The algorithm requires to select a random 
code word r& G C. The system then computes the vector 

I ■— b - r b 

and stores on the system: 

(h(r b ),l). 

The following is a strengthening of the main theorem 
in fT7l . 

Theorem 1: If the possible biometrics b G F n are 
uniformly distributed then computing the biometric b G 
F n from the stored data (h(rb),l) is computationally 
equivalent to invert the 'restricted' hash function 

h \ C : C — ^F'. 

Proof: Since b and r b were selected independently 
and uniformly at random the vector I := b — r b reveals 
no information about the random choice of r b G C. An 
attacker is left with the task to compute r& from h{r b ). 

■ 

The theorem provides the means to come up with a 
practical secure storage system once we can assume that 
the biometrics are uniformly distributed over the ambient 
space F n . If this is the case and if h is a hash function 



which is practically secure then we only have to require 
that the size of the code \C\ > 2 80 . This is due to the 
fact that it is generally accepted that a total search space 
of 2 80 is beyond the capabilities of modern computers. 
As a result it is desirable that the constructed codes have 
dimension k = dimC > 80. 

The following lemma shows that the system allows to 
accept an authorized user as soon as this user provides 
a biometric vector which comes close enough to the 
originally supplied vector b G F n . 

Lemma 2: Let b G F n be a vector whose Hamming 
distance satisfies: 

d H (b~b)<t. 

Then it is possible to efficiently compute b from the 
stored data (/i(r^), I). (In fact authorization is granted by 
comparing the hash stored with the hash of the decoded 
codeword, without any need to compute b.) 
Proof: 

duin, b-l) = d H (b -1,1-1) = d H (b, b) < t. 

The vector b — I decodes by assumption uniquely to 
the code vector r b . Knowing r b and I is equivalent to 
knowing b. ■ 

Several considerations are due at this moment, starting 
with the choice of the code to use. 

In ifrTl it is proposed that Reed-Solomon and BCH 
codes might provide useful results (see also lPT4ll ). We 
believe these are not necessarily good options for two 
reasons. First practical biometric systems have often to 
deal with large amount of bits (an estimate in some 
circumstances could be 1C/000 bits). Moreover we can 
say an error tolerance of 10% of errors is a reasonable 
requirement. BCH codes of block length 10 4 and dis- 
tance 2'000 are necessarily of very low rate and it is 
practically not feasible to run e.g. a Berlekamp-Massey 
algorithm once so many syndromes are involved. 

The next section addresses the choice of the code. 

III. Choice of the code 

Based on the comments in the last section we require 
an [n, k, d] linear code whose dimension is k > 80 over 
the binary field, possibly smaller if one works over larger 
alphabets. In addition one wants to have a large relative 
minimum distance that only low rate codes can afford. 
Indeed because e.g. of the asymptotic Elias upper bound 
(see e.g. HI) only very low rate binary codes can have 
relative distance larger than e.g. 0.4. Of course the code 
should come with efficient decoding algorithms even 
when the block length is in the range of n = 10 4 . 



We think of two types of codes as possible candidates 
for this application, namely 1) Product codes, and 2) 
LDPC codes. Both these codes can be decoded with 
linear or close to linear complexity in the block length. 

Let us consider the first option: product of classical 
codes. We can define them using the generator matrices 
(see e.g. [20]): If A and B are the generator matrices of 
two codes, C\ and Ci, with parameters (m, fci,di) and 
(ri2, ^2,^2), then the Kronecker product of matrices 

A®B = (aijB) 

obtained by replacing every entry a»j of A by a>ijB is 
the generator matrix of the product code. 

The new code has parameters (niB2, k\k2, d\d2) and 
can be viewed as the set of all codewords consisting 
of ni x ri2 arrays constructed in such a way that every 
column is a codeword of the first code and every row is 
a codeword of the second one. 

Clearly, given the definition of the product of two 
codes, the product of more than two codes can be defined 
as well. 

We give here some examples of product of two codes 
with parameters getting close to (100000, 100, 20000): 

• (512,98,93), a classical Goppa code and 
(200, 1, 200), a repetition code. 

• (121,49,37), an extended Goppa code |[28l and 
(825,2,550), where codewords are the all-zero 
codeword, two codewords with respectively the first 
and the last 275 bits equal to ones and the other 
zeroes, and the sum of these two; 

• (144,50,48), an extended Goppa code and 
(693,2,462), where codewords are the all-zero 
codeword, two codewords with respectively the first 
and the last 231 bits equal to ones and the other 
zeroes, and the sum of these two; 

• (256,26,116), an extended Goppa code and 
(400,4,200), an (8,4,4) extended Hamming code 
with each symbol repeated 50 times. 

The decoding procedure of such product codes is 
based on iterative algorithms, where one decodes alterna- 
tively by columns and by rows (see also |[23l . 11241 . |[25l ). 
Thanks to this kind of splitting in the decoding, we can 
afford to use classical codes such as Goppa codes, while 
maintaining a reasonable computational complexity. 

Since the first version of our paper was made available 
at the arXiv a similar choice of coding scheme was 
proposed in [fl~5l. 

As for LDPC codes, the difficulty seems mainly that 
of finding the parameters we need. Codes studied in the 
literature often aim at rates of 1/2 or higher. Such codes 



necessarily have a relatively poor relative minimum 
distance 

Among the many constructions in the literature, we 
believe that RA, IRA and eIRA codes (see for ex- 
ample 11291 , ED) should be good candidates with this 
respect. We have also taken into consideration the use 
of algebraic constructions of LDPC codes, such as the 
Margulis-Ramanujan type 11301 : in this case we should 
modify the construction to lower the rate, for example 
by taking m + 1 copies of the graph on the left and m 
on the right for a suitable m, but we face the difficulty 
of finding a good minimum distance |[T9l . 

Actually turbo codes could be a better option for a low 
rate; though in more pratical scenarios, as we will see in 
next section, such low rates are not convenient anymore 
for security reasons and more standard parameters suit 
better. 

IV. Distribution of biometric templates 

Theorem Q] works under the strong assumption that 
the biometric data is uniformly and randomly distributed 
over the ambient space F n . In practical applications this 
is a very unlikely scenario. In this section we estimate a 
threshold for the dimension k of the code, above which 
the commitment scheme of Juels and Wattenberg is most 
probably secure. 

First note that if one has some information about the 
biometric b it will be possible to recover from / some 
information about r\,. Dependent on the size of C it might 
be possible to do a search among all codewords with a 
particular pattern and consequently break the system. To 
possibly defend the system from this attack, one could 
essentially take a higher rate code (but at the expense 
of lowering the minimum distance). So our next step is 
to relate the uncertainty or randomness connected to the 
string with the dimension required for the code. 

Following |4J, we can speak of the entropy of a binary 
string as the log in base 2 of the number of possible 
strings: so, for example, for a binary string of length n, 
where each bit is chosen independently and randomly 
between and 1, the entropy is defined to be n and it 
is measured in units of information or Shannon bits (see 
e.g. [10]). If the string is not random, the entropy is the 
log of the number of the so called typical sequences; if, 
for example, each bit is chosen independently to be 1 
with probability p and with probability 1 — p, then the 
entropy of the string is nh(p), where h(p) = — \plogp + 
q log q] is the Shannon function. 

Now, let H{b) be the entropy of the biometric. If that 
is n, that means that biometrics are randomly distributed, 



then we can afford a code with dimension k = ko (ko = 
100, say). When the distribution is not really random, 
then the "number of possible strings" is reduced from 

2 n tQ 2 H(b)_ 

So, roughly speaking, it is like the eavesdropper Eve 
knows the correct bit at n — H(b) positions, so that if we 
want her to search nevertheless among 2 k ° codewords, 
then, counting in the worst case over all possible strings 
for those positions, we should need 2 k ° ■ 2^ n ~ H ^ 
codewords, i.e. the dimension should be 

k > k + (n- H(b)). 

Clearly, as said, we are considering a worst case sce- 
nario, so that this requirement makes sense for, let's say, 
reasonable values of the parameters, that is k$ « H(b); 
otherwise k could be asked to be even larger than n. 
Essentially our requirement is purposely asking a bit too 
much than the strict minimum, which though doesn't 
waste at all in a security concern. 

To see the issue from another view point, we can think 
of a channel, where at one side we have the message 
and at the other end there's Eve which tries to decode and 
get from the pair Hfo), I. The converse of Shannon's 
noisy coding theorem says that the probability of correct 
decoding can be bounded as 2~ nG ^ where G(R) is a 
positive function of the rate R for R > C. So in some 
sense we have estimated the capacity of this channel as 

ko , 1 H(b) 
n n ' 

(For references on information theory, Shannon's 
noisy coding theorem and its converse |5j, |[22l . |[32l . 

El, EH.) 

V. Practical Implementation Issues 

The fact that, as far as we know, the fuzzy scheme 
has not found yet so many real applications in biometric 
storage, depends not only in the way of implementing it 
as we have discussed it so far, but also in further practical 
difficulties that make the problem more complicated than 
how we stated it. 

The main problem to overcome is the fact that the 
scheme requires that the two passwords to be compared 
are prealigned; and the difficulty consists in aligning with 
a password that is not in the clear. There are also some 
other aspects one has to improve or fix; for example 
one has to take into account the possibility of erasures 
and unordered collection of biometric features. The error 
distribution is also far from uniform in practical schemes. 

In the literature 0, 0, Q, QQ, (H, HI, 03, (35l, 
11371 we can find a deeper discussion of all these side 
problems together with proposals to attack some of them, 



each of them with its pros and cons. In the following 
section we propose another way of dealing with it, i.e. 
we propose to use, instead of biometrics, some particular 
histograms derived from them that can capture important 
features of the images. As a side effect, since these 
histograms are also a means of compression, we would 
obtain smaller lengths for the passwords to be hashed and 
also we wouldn't need to require such a high minimum 
distance. So looking for different and more convenient 
code parameters could be a relevant consequence. 

VI. Histograms and Alignment 

What we essentially want to do to solve the pre- 
alignment problem is to somehow transform the bio- 
metric passwords and store the output of the transfor- 
mation. What we first require from this "function" is 
to be resistant to noise, changes in illumination and 
transformations such as translation and rotation. The 
literature EL H21, ED indicates that the so called 
"multiresolution histograms", that are sets of intensity 
histograms of an image at multiple image resolution, 
satisfy these prerequisites. So they could possibly solve 
our problem, but we require another important feature, 
i.e. we want the transformation to be one-to-one or at 
least that not too many different biometrics give the same 
output. Pass and Zabih |[26l , |[27ll worked in this direction 
and introduced the notions of histogram refinement and 
joint histograms. We believe that some transformation 
of this kind that encompasses these features could be 
a solution to overcome the problem of alignment. And 
also new issues would consequently follow: the size of 
error tolerance required (that would be much reduced) 
and the choice of other suitable code parameters. 
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